Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Numerical and Categorial Data

Coorelation- test

Check Skewness/Distribution for numerical data

Look for skewed column in numerical data but ignore dates, Days,Flags, status, ID's.

Skewness in : AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY

AMT_CREDIT and AMT_ANNUITY looks skewed. we wil do log transformation on these attributes and make them more mormalized.

AMT_CREDIT

AMT_ANNUITY

Missing data for application train

Distribution of the target column

This shows that around 8% of people are not able to repay the loans back.

Correlation with the target column

Applicants Age

Applicants occupations

17% of our applicants are labourers and around 10% are from the Sales. This seems like folks which are from the lower income range which apply for the loan.

Bureau

Bureau_balance

As per normalized view of tables provided above, this will be linked with bureau and then with application test/train table.

As seen above, we are not getting much information out of Bureau balance table. We might drop this table all together but that could be done when we'll select the best elements based on KBest or attribute selection.

Credit card balance.

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

added new

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Join the labeled dataset

Join the unlabeled dataset (i.e., the submission file)

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

OHE case study: The breast cancer wisconsin dataset (classification)

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Baseline with 14 features

Split train, validation and test sets

Baseline 2: All features

Baseline 3: 79 Features

Selected Features

Remove elements with more than 50% nulls

Baseline 4 : 79 features; 2 log features

Remove elements with more than 50% nulls with log AMT_ANNUITY and AMT_CREDIT

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

report submission

Click on this link

image-2.png

Write-up

For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:

Abstract

Our goal was to determine how accurately we can predict a new applicant's ability to pay loans. Our area of focus during his week was EDA, feature engineering and building a baseline model for our overall goal. From the EDA, write about eda AFter seeing the coorelations and adding necessary features, creating a data pipeline for char as well as numerical values, we created a pipeline and fed it into logistic regression. We got the bestparameters for C=100 and penalty as 'l2' (Gridsearch, cv = 5). From this pipeline, we obtained an ROC score of 73.57% with an accuracy of 91.58%.
**write about submission

Introduction

Feature Engineering and transformers

Please explain the work you conducted on feature engineering and transformers. Please include code sections when necessary as well as images or any relevant material.

For feature engineering, first we found 2 skewed attributes from the numerical features and did log transformation on the same. Sam eis in section

Pipelines

Please explain the pipelines you created for this project and how you used them Please include code sections when necessary as well as images or any relevant material

All pipelines are prepared under section 7.

Our All Feature model has all the 120 features (except 'SK_ID_CURR', 'TARGET' )as available in the training set. Then there are 2 set of models where we are selecting the features that we want to use. 1st one, Baseline with 14 attributes has 14 attributes, and another one, Baseline with 79 attributes has 79 attributes. for model with 14 elements, we used 7 numerical and 7 categorical attributes. For model with 79 attributes, we removed the elements which have more than 50% null values in the the training data. Last model which we used, Baseline with log features it has 2 attributes which were log transformed. Apart from that, we used the 79 elements model with it.

Numerical features were standardized using Standard scaler, and we used median to fill in the missing values. As far as categorical data is concerned, We did one hot encoding to standatdize and missing values were filled using the ost frequent values.

AUC_ROC score is used to evaluate our models.

We have created 3 Logistic regression models:

Baseline1_all features 120 raw inputs
Baseline2_all features 14 inputs
Baseline3_selected features 79 raw inputs
Baseline4_selected features 79 raw inputs with a log feature (log AMT_ANNUITY, log AMT_CREDIT)


image.png

image-3.png

Experimental results

Please present the results of the various experiments that you conducted. The results should be shown in a table or image. Try to include the different details for each experiment.

Please include code sections when necessary as well as images or any relevant material

image.png

Discussion

We did 4 experimemts as stated in the picture attached. Except for model with 14 features, all other models have training accuracy of 92%. Best validation accuracy was schieved for model 3 and 4 and section 7.4 and 7.5. For the scoring, we used AUC score which was 74.34% for the model with all attributes.

With more robust feature selection, OHE's, loss functions, and grid paramaters, we think that the score could be improved substantially. We will use our current understanding and inculcalte the learnings and will try to improve our results.

Conclusion

The focus of this project was to determine and predict the ability for a client to repay their loan. This is important for Home Credit to determine which clients will repay their loans.

This baseline model utilized logistic regression using column transformer and pushed through numerical and categorical pipelines. The numerical features were standardized, for categorical features OHE, and both feature types were imputed with median. Feature engineering was also done.

The hypothesis was that using Logistic Regression we would be able to make predictions between the 70-75%. 4 baseline models were used with 14, 129, 79, and 79 with log features which achieved a test accuracy of 0.7357, 0.7434, 0.7406, 0.7405. The baseline model with all features had the highest test AUC score in notebook(0.7434).

For the next phase, we will be performing feature engineering on all the data sets to find important features, perform dimensionality reduction, and SVC to improve the score.

Kaggle Submission

Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.

image-2.png

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: